</div>
Amirali Soltani Tehrani
Python
Python) is an interpreted high-level general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small- and large-scale projects.
Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s, as a successor to the ABC programming language, and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020.
print('I love Machine Learning course.'.upper()+' (upper)')
print('I love Machine Learning course.'.rjust(2)+ ' (rjust 20)')
print('i love Machine Learning course.'.capitalize()+ ' (capitalize)')
print(' I love Machine Learning course. '.strip()+ ' (strip)')
I LOVE MACHINE LEARNING COURSE. (upper) I love Machine Learning course. (rjust 20) I love machine learning course. (capitalize) I love Machine Learning course. (strip)
ml_sem_code = 2022
print('I like ' + str(ml_sem_code) + ' a lot!')
print(f'{print} (print a function)')
print(f'{type(229)} (print a type)')
I like 2022 a lot! <built-in function print> (print a function) <class 'int'> (print a type)
txt = "For only {price:.2f} dollars!"
print(txt.format(price = 49))
For only 49.00 dollars!
list_1 = ['one', 'two', 'three']
list_1.append(4)
list_1.insert(0, 'ZERO')
print(list_1)
['ZERO', 'one', 'two', 'three', 4]
list_2 = [1, 2, 3]
list_1.extend(list_2)
print(list_1)
['ZERO', 'one', 'two', 'three', 4, 1, 2, 3]
long_list = [i for i in range(9)]
long_long_list = [(i, j) for i in range(3)
for j in range(5)]
print(long_long_list)
long_list_list = [[i for i in range(3)]
for _ in range(5)]
print(long_list)
print(long_long_list)
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4)] [0, 1, 2, 3, 4, 5, 6, 7, 8] [(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4)]
random_list_2 = [(3, 'z'), (12, 'r'), (6, 'e'),
(8, 'c'), (2, 'g')]
sorted(random_list_2, key=lambda x: x[1])
print(random_list_2)
[(3, 'z'), (12, 'r'), (6, 'e'), (8, 'c'), (2, 'g')]
my_set = {i ** 2 for i in range(10)}
print(my_set)
{0, 1, 64, 4, 36, 9, 16, 49, 81, 25}
my_dict = {(5-i): i ** 2 for i in range(10)}
print(my_dict)
{5: 0, 4: 1, 3: 4, 2: 9, 1: 16, 0: 25, -1: 36, -2: 49, -3: 64, -4: 81}
second_dict = {'a': 10, 'b': 11}
my_dict.update(second_dict)
print(my_dict)
{5: 0, 4: 1, 3: 4, 2: 9, 1: 16, 0: 25, -1: 36, -2: 49, -3: 64, -4: 81, 'a': 10, 'b': 11}
for k, it in my_dict.items():
print(k, it)
5 0 4 1 3 4 2 9 1 16 0 25 -1 36 -2 49 -3 64 -4 81 a 10 b 11
Numpy
| Python Command | Description |
|---|---|
| np.linalg.inv | Inverse of matrix (numpy as equivalent) |
| np.linalg.eig | Get eigen values & eigen vectors of arr |
| np.matmul | Matrix multiply |
| np.zeros/ones | Create a matrix filled with zeros/ones |
| np.arange | Start, stop, step size (more np.linspace) |
| np.identity | Create an identity matrix |
| np.vstack | Vertically stack 2 arrays (more np.hstack) |
| Python Command | Description |
|---|---|
| array.shape | Get shape of numpy array |
| array.dtype | Check data type of array (for precision, for weird behavior) |
| type(stuff) | Get type of a variable |
| import pdb; pdb.set_trace() | Set a breakpoint |
import numpy as np
array_1d = np.array([1, 2, 3, 4])
print(array_1d)
array_1by4 = np.array([[1, 2, 3, 4]])
print(array_1by4)
large_array = np.array([i for i in range(36)])
print(large_array)
large_array = large_array.reshape((6, 6))
print(large_array)
[1 2 3 4] [[1 2 3 4]] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35] [[ 0 1 2 3 4 5] [ 6 7 8 9 10 11] [12 13 14 15 16 17] [18 19 20 21 22 23] [24 25 26 27 28 29] [30 31 32 33 34 35]]
from_list = np.array([1, 2, 3])
from_list_2d = np.array([[1, 2, 3.0], [4, 5, 6]])
from_list_bad_type = np.array([1, 2, 3, 'a'])
print(f'Data type of integer is {from_list.dtype}')
print(f'Data type of float is {from_list_2d.dtype}')
Data type of integer is int64 Data type of float is float64
array_1 = np.array([1, 2, 3])
array_1 + 5
array_1 * 5
np.sqrt(array_1)
np.power(array_1, 2)
np.exp(array_1)
np.log(array_1)
array([0. , 0.69314718, 1.09861229])
array_1 = np.array([1, 2, 3])
array_2 = array_1
array_1 @ array_2
array_1.dot(array_2)
np.dot(array_1, array_2)
14
weight_matrix = np.array([1, 2, 3, 4]).reshape(2, 2)
sample = np.array([[50, 60]]).T
np.matmul(weight_matrix, sample)
array([[170],
[390]])
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
np.matmul(mat1, mat2)
array([[19, 22],
[43, 50]])
a = np.array([i for i in range(10)]).reshape(2, 5)
a * a
np.multiply(a, a)
np.multiply(a, 10)
array([[ 0, 10, 20, 30, 40],
[50, 60, 70, 80, 90]])
NumPy compares dimensions of operands, then infers missing/mismatched dimensions so the operation is still valid. Be careful with dimensions!
op1 = np.array([i for i in range(9)]).reshape(3, 3)
op2 = np.array([[1, 2, 3]])
op3 = np.array([1, 2, 3])
# Results are different here
print(op1 + op2)
print(op1 + op2.T)
[[ 1 3 5] [ 4 6 8] [ 7 9 11]] [[ 1 2 3] [ 5 6 7] [ 9 10 11]]
# Results are same here
print(op1 + op3)
print(op1 + op3.T)
[[ 1 3 5] [ 4 6 8] [ 7 9 11]] [[ 1 3 5] [ 4 6 8] [ 7 9 11]]
samples = np.random.random((3, 5))
# Without broadcasting
expanded1 = np.expand_dims(samples, axis=1)
tile1 = np.tile(expanded1, (1, samples.shape[0], 1))
expanded2 = np.expand_dims(samples, axis=0)
tile2 = np.tile(expanded2, (samples.shape[0], 1 ,1))
diff = tile2 - tile1
distances = np.linalg.norm(diff, axis=-1)
print(distances)
# With broadcasting
diff = samples[: ,np.newaxis, :]
- samples[np.newaxis, :, :]
distances = np.linalg.norm(diff, axis=-1)
print(distances)
# Also could use scipy
import scipy.spatial
distances = scipy.spatial.distance.cdist(samples, samples)
print(distances)
[[0. 0.497356 1.07883004] [0.497356 0. 1.087224 ] [1.07883004 1.087224 0. ]] [[1.21238139] [1.15884709] [1.47317763]] [[0. 0.497356 1.07883004] [0.497356 0. 1.087224 ] [1.07883004 1.087224 0. ]]
Shorter code, faster execution! Look at these examples.
import time
a = np.random.random(500000)
b = np.random.random(500000)
# Using NumPy dot product
t = time.time()
dot = np.array(a).dot(np.array(b))
print("Execution time with numpy: " + str(time.time() - t))
print(dot)
# Using for loops
dot = 0.0
t = time.time()
for i in range(len(a)):
dot += a[i] * b[i]
print("Execution time with for loop: " + str(time.time() - t))
print(dot)
Execution time with numpy: 0.006165742874145508 124656.76023488915 Execution time with for loop: 0.2621474266052246 124656.7602348879
Speed up depends on setup and nature of computation!
samples = np.random.random((100, 5))
# Using NumPy with broadcasting
t = time.time()
diff = samples[: ,np.newaxis, :] - samples[np.newaxis, :, :]
distances = np.linalg.norm(diff, axis=-1)
avg_dist = np.mean(distances)
print("Execution time with numpy: " + str(time.time() - t))
print(avg_dist)
# Using for loops
t = time.time()
total_dist = []
for s1 in samples:
for s2 in samples:
d = np.linalg.norm(s1 - s2)
total_dist.append(d)
avg_dist = np.mean(total_dist)
print("Execution time with for loop: " + str(time.time() - t))
print(avg_dist)
Execution time with numpy: 0.001271963119506836 0.8475971610622879 Execution time with for loop: 0.06235051155090332 0.8475971610622879
Tools for Plotting
Mostly used for Visualization (line, scatter, bar, images and even interactive 3D)
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)
# Plotting
fig, ax = plt.subplots()
ax.plot(t, s)
# Format plotting
ax.set(xlabel='time (s)', ylabel='voltage (mV)',
title='About as simple as it gets, folks')
ax.grid()
# Save/show
fig.savefig("test.png")
plt.show()
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 500)
y = np.sin(x)
fig, ax = plt.subplots()
line1, = ax.plot(x, y, label='Using set_dashes()')
# 2pt line, 2pt break, 10pt line, 2pt break
line1.set_dashes([2, 2, 10, 2])
line2, = ax.plot(x, y - 0.2, dashes=[6, 2],
label='Using the dashes parameter')
ax.legend()
plt.show()
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Setup grid with height 2 and col 1.
# Plot the 1st subplot
plt.subplot(2, 1, 1)
plt.grid()
plt.plot(x, y_sin)
plt.title('Sine Wave')
# Now plot on the 2nd subplot
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine Wave')
plt.grid()
plt.tight_layout()
Specially used in ROC curves
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel="linear", C=0.01).fit(X_train, y_train)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
titles_options = [
("Confusion matrix, without normalization", None),
("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
disp = ConfusionMatrixDisplay.from_estimator(
classifier,
X_test,
y_test,
display_labels=class_names,
cmap=plt.cm.Blues,
normalize=normalize,
)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
plt.show()
Confusion matrix, without normalization [[13 0 0] [ 0 10 6] [ 0 0 9]] Normalized confusion matrix [[1. 0. 0. ] [0. 0.62 0.38] [0. 0. 1. ]]
Mostly used for
Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.
The main data structures in Pandas are implemented with Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Series instances. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)
df = pd.read_csv("churn-bigml-80.csv")
df.head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | 415 | No | Yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False |
| 1 | OH | 107 | 415 | No | Yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False |
| 2 | NJ | 137 | 415 | No | No | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False |
| 3 | OH | 84 | 408 | Yes | No | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False |
| 4 | OK | 75 | 415 | Yes | No | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False |
Let’s have a look at data dimensionality, feature names, and feature types.
print(df.shape)
(3333, 20)
From the output, we can see that the table contains 3333 rows and 20 columns.
print(df.columns)
Index(['State', 'Account length', 'Area code', 'International plan',
'Voice mail plan', 'Number vmail messages', 'Total day minutes',
'Total day calls', 'Total day charge', 'Total eve minutes',
'Total eve calls', 'Total eve charge', 'Total night minutes',
'Total night calls', 'Total night charge', 'Total intl minutes',
'Total intl calls', 'Total intl charge', 'Customer service calls',
'Churn'],
dtype='object')
We can use the info() method to output some general information about the dataframe:
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 3333 non-null object 1 Account length 3333 non-null int64 2 Area code 3333 non-null int64 3 International plan 3333 non-null object 4 Voice mail plan 3333 non-null object 5 Number vmail messages 3333 non-null int64 6 Total day minutes 3333 non-null float64 7 Total day calls 3333 non-null int64 8 Total day charge 3333 non-null float64 9 Total eve minutes 3333 non-null float64 10 Total eve calls 3333 non-null int64 11 Total eve charge 3333 non-null float64 12 Total night minutes 3333 non-null float64 13 Total night calls 3333 non-null int64 14 Total night charge 3333 non-null float64 15 Total intl minutes 3333 non-null float64 16 Total intl calls 3333 non-null int64 17 Total intl charge 3333 non-null float64 18 Customer service calls 3333 non-null int64 19 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB None
bool, int64, float64 and object are the data types of our features. We see that one feature is logical (bool), 3 features are of type object, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with shape.
We can change the column type with the astype method. Let’s apply this method to the Churn feature to convert it into int64:
df["Churn"] = df["Churn"].astype("int64")
The describe method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.
df.describe()
| Account length | Area code | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 | 3333.00 |
| mean | 101.06 | 437.18 | 8.10 | 179.78 | 100.44 | 30.56 | 200.98 | 100.11 | 17.08 | 200.87 | 100.11 | 9.04 | 10.24 | 4.48 | 2.76 | 1.56 | 0.14 |
| std | 39.82 | 42.37 | 13.69 | 54.47 | 20.07 | 9.26 | 50.71 | 19.92 | 4.31 | 50.57 | 19.57 | 2.28 | 2.79 | 2.46 | 0.75 | 1.32 | 0.35 |
| min | 1.00 | 408.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 23.20 | 33.00 | 1.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 74.00 | 408.00 | 0.00 | 143.70 | 87.00 | 24.43 | 166.60 | 87.00 | 14.16 | 167.00 | 87.00 | 7.52 | 8.50 | 3.00 | 2.30 | 1.00 | 0.00 |
| 50% | 101.00 | 415.00 | 0.00 | 179.40 | 101.00 | 30.50 | 201.40 | 100.00 | 17.12 | 201.20 | 100.00 | 9.05 | 10.30 | 4.00 | 2.78 | 1.00 | 0.00 |
| 75% | 127.00 | 510.00 | 20.00 | 216.40 | 114.00 | 36.79 | 235.30 | 114.00 | 20.00 | 235.30 | 113.00 | 10.59 | 12.10 | 6.00 | 3.27 | 2.00 | 0.00 |
| max | 243.00 | 510.00 | 51.00 | 350.80 | 165.00 | 59.64 | 363.70 | 170.00 | 30.91 | 395.00 | 175.00 | 17.77 | 20.00 | 20.00 | 5.40 | 9.00 | 1.00 |
For categorical (type object) and boolean (type bool) features we can use the value_counts method. Let’s take a look at the distribution of Churn:
df["Churn"].value_counts()
0 2850 1 483 Name: Churn, dtype: int64
df["Churn"].value_counts(normalize=True)
0 0.86 1 0.14 Name: Churn, dtype: float64
A DataFrame can be sorted by the value of one of the variables (i.e columns). For example, we can sort by Total day charge (use ascending=False to sort in descending order):
df.sort_values(by="Total day charge", ascending=False).head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 296 | CO | 154 | 415 | No | No | 0 | 350.8 | 75 | 59.64 | 216.5 | 94 | 18.40 | 253.9 | 100 | 11.43 | 10.1 | 9 | 2.73 | 1 | 1 |
| 780 | NY | 64 | 415 | Yes | No | 0 | 346.8 | 55 | 58.96 | 249.5 | 79 | 21.21 | 275.4 | 102 | 12.39 | 13.3 | 9 | 3.59 | 1 | 1 |
| 2087 | OH | 115 | 510 | Yes | No | 0 | 345.3 | 81 | 58.70 | 203.4 | 106 | 17.29 | 217.5 | 107 | 9.79 | 11.8 | 8 | 3.19 | 1 | 1 |
| 128 | OH | 83 | 415 | No | No | 0 | 337.4 | 120 | 57.36 | 227.4 | 116 | 19.33 | 153.9 | 114 | 6.93 | 15.8 | 7 | 4.27 | 0 | 1 |
| 485 | MO | 112 | 415 | No | No | 0 | 335.5 | 77 | 57.04 | 212.5 | 109 | 18.06 | 265.0 | 132 | 11.93 | 12.7 | 8 | 3.43 | 2 | 1 |
df.sort_values(by=["Churn", "Total day charge"], ascending=[True, False]).head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2812 | MN | 13 | 510 | No | Yes | 21 | 315.6 | 105 | 53.65 | 208.9 | 71 | 17.76 | 260.1 | 123 | 11.70 | 12.1 | 3 | 3.27 | 3 | 0 |
| 1818 | NC | 210 | 415 | No | Yes | 31 | 313.8 | 87 | 53.35 | 147.7 | 103 | 12.55 | 192.7 | 97 | 8.67 | 10.1 | 7 | 2.73 | 3 | 0 |
| 2770 | LA | 67 | 510 | No | No | 0 | 310.4 | 97 | 52.77 | 66.5 | 123 | 5.65 | 246.5 | 99 | 11.09 | 9.2 | 10 | 2.48 | 4 | 0 |
| 460 | SD | 114 | 415 | No | Yes | 36 | 309.9 | 90 | 52.68 | 200.3 | 89 | 17.03 | 183.5 | 105 | 8.26 | 14.2 | 2 | 3.83 | 1 | 0 |
| 2302 | AL | 141 | 510 | No | Yes | 28 | 308.0 | 123 | 52.36 | 247.8 | 128 | 21.06 | 152.9 | 103 | 6.88 | 7.4 | 3 | 2.00 | 1 | 0 |
A DataFrame can be indexed in a few different ways.
To get a single column, you can use a DataFrame['Name'] construction. Let’s use this to answer a question about that column alone: what is the proportion of churned users in our dataframe?
df["Churn"].mean()
0.14491449144914492
Boolean indexing with one column is also very convenient. The syntax is df[P(df['Name'])], where P is some logical condition that is checked for each element of the Name column. The result of such indexing is the DataFrame consisting only of rows that satisfy the P condition on the Name column.
Let’s use it to answer the question:
What are average values of numerical features for churned users?
df[df["Churn"] == 1].mean()
/tmp/ipykernel_4094/507517496.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. df[df["Churn"] == 1].mean()
Account length 102.66 Area code 437.82 Number vmail messages 5.12 Total day minutes 206.91 Total day calls 101.34 Total day charge 35.18 Total eve minutes 212.41 Total eve calls 100.56 Total eve charge 18.05 Total night minutes 205.23 Total night calls 100.40 Total night charge 9.24 Total intl minutes 10.70 Total intl calls 4.16 Total intl charge 2.89 Customer service calls 2.23 Churn 1.00 dtype: float64
How much time (on average) do churned users spend on the phone during daytime?
df[df["Churn"] == 1]["Total day minutes"].mean()
206.91407867494823
What is the maximum length of international calls among loyal users (Churn == 0) who do not have an international plan?
df[(df["Churn"] == 0) & (df["International plan"] == "No")]["Total intl minutes"].max()
18.9
df.loc[0:5, "State":"Area code"]
| State | Account length | Area code | |
|---|---|---|---|
| 0 | KS | 128 | 415 |
| 1 | OH | 107 | 415 |
| 2 | NJ | 137 | 415 |
| 3 | OH | 84 | 408 |
| 4 | OK | 75 | 415 |
| 5 | AL | 118 | 510 |
df.apply(np.max)
State WY Account length 243 Area code 510 International plan Yes Voice mail plan Yes Number vmail messages 51 Total day minutes 350.8 Total day calls 165 Total day charge 59.64 Total eve minutes 363.7 Total eve calls 170 Total eve charge 30.91 Total night minutes 395.0 Total night calls 175 Total night charge 17.77 Total intl minutes 20.0 Total intl calls 20 Total intl charge 5.4 Customer service calls 9 Churn 1 dtype: object
The apply method can also be used to apply a function to each row. To do this, specify axis=1. Lambda functions are very convenient in such scenarios. For example, if we need to select all states starting with ‘W’, we can do it like this:
df[df["State"].apply(lambda state: state[0] == "W")].head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | WV | 141 | 415 | Yes | Yes | 37 | 258.6 | 84 | 43.96 | 222.0 | 111 | 18.87 | 326.4 | 97 | 14.69 | 11.2 | 5 | 3.02 | 0 | 0 |
| 22 | WY | 57 | 408 | No | Yes | 39 | 213.0 | 115 | 36.21 | 191.1 | 112 | 16.24 | 182.7 | 115 | 8.22 | 9.5 | 3 | 2.57 | 0 | 0 |
| 38 | WI | 64 | 510 | No | No | 0 | 154.0 | 67 | 26.18 | 225.8 | 118 | 19.19 | 265.3 | 86 | 11.94 | 3.5 | 3 | 0.95 | 1 | 0 |
| 41 | WY | 97 | 415 | No | Yes | 24 | 133.2 | 135 | 22.64 | 217.2 | 58 | 18.46 | 70.6 | 79 | 3.18 | 11.0 | 3 | 2.97 | 1 | 0 |
| 45 | WY | 87 | 415 | No | No | 0 | 151.0 | 83 | 25.67 | 219.7 | 116 | 18.67 | 203.9 | 127 | 9.18 | 9.7 | 3 | 2.62 | 5 | 1 |
The map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument:
d = {"No": False, "Yes": True}
df["International plan"] = df["International plan"].map(d)
df.head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | 415 | False | Yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | 0 |
| 1 | OH | 107 | 415 | False | Yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | 0 |
| 2 | NJ | 137 | 415 | False | No | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | 0 |
| 3 | OH | 84 | 408 | True | No | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | 0 |
| 4 | OK | 75 | 415 | True | No | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | 0 |
In general, grouping data in Pandas works as follows:
df.groupby(by=grouping_columns)[columns_to_show].function()
First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.
Finally, one or several functions are applied to the obtained groups per selected columns.
columns_to_show = ["Total day minutes", "Total eve minutes", "Total night minutes"]
df.groupby(["Churn"])[columns_to_show].describe(percentiles=[])
| Total day minutes | Total eve minutes | Total night minutes | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 50% | max | count | mean | std | min | 50% | max | count | mean | std | min | 50% | max | |
| Churn | ||||||||||||||||||
| 0 | 2850.0 | 175.18 | 50.18 | 0.0 | 177.2 | 315.6 | 2850.0 | 199.04 | 50.29 | 0.0 | 199.6 | 361.8 | 2850.0 | 200.13 | 51.11 | 23.2 | 200.25 | 395.0 |
| 1 | 483.0 | 206.91 | 69.00 | 0.0 | 217.6 | 350.8 | 483.0 | 212.41 | 51.73 | 70.9 | 211.3 | 363.7 | 483.0 | 205.23 | 47.13 | 47.4 | 204.80 | 354.9 |
columns_to_show = ["Total day minutes", "Total eve minutes", "Total night minutes"]
df.groupby(["Churn"])[columns_to_show].agg([np.mean, np.std, np.min, np.max])
| Total day minutes | Total eve minutes | Total night minutes | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | amin | amax | mean | std | amin | amax | mean | std | amin | amax | |
| Churn | ||||||||||||
| 0 | 175.18 | 50.18 | 0.0 | 315.6 | 199.04 | 50.29 | 0.0 | 361.8 | 200.13 | 51.11 | 23.2 | 395.0 |
| 1 | 206.91 | 69.00 | 0.0 | 350.8 | 212.41 | 51.73 | 70.9 | 363.7 | 205.23 | 47.13 | 47.4 | 354.9 |
Like many other things in Pandas, adding columns to a DataFrame is doable in many ways.
For example, if we want to calculate the total number of calls for all users, let’s create the total_calls Series and paste it into the DataFrame:
total_calls = (
df["Total day calls"]
+ df["Total eve calls"]
+ df["Total night calls"]
+ df["Total intl calls"]
)
df.insert(loc=len(df.columns), column="Total calls", value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | ... | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | Total calls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | 415 | False | Yes | 25 | 265.1 | 110 | 45.07 | 197.4 | ... | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | 0 | 303 |
| 1 | OH | 107 | 415 | False | Yes | 26 | 161.6 | 123 | 27.47 | 195.5 | ... | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | 0 | 332 |
| 2 | NJ | 137 | 415 | False | No | 0 | 243.4 | 114 | 41.38 | 121.2 | ... | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | 0 | 333 |
| 3 | OH | 84 | 408 | True | No | 0 | 299.4 | 71 | 50.90 | 61.9 | ... | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | 0 | 255 |
| 4 | OK | 75 | 415 | True | No | 0 | 166.7 | 113 | 28.34 | 148.3 | ... | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | 0 | 359 |
5 rows × 21 columns
df["Total charge"] = (
df["Total day charge"]
+ df["Total eve charge"]
+ df["Total night charge"]
+ df["Total intl charge"]
)
df.head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | ... | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | Total calls | Total charge | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | 415 | False | Yes | 25 | 265.1 | 110 | 45.07 | 197.4 | ... | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | 0 | 303 | 75.56 |
| 1 | OH | 107 | 415 | False | Yes | 26 | 161.6 | 123 | 27.47 | 195.5 | ... | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | 0 | 332 | 59.24 |
| 2 | NJ | 137 | 415 | False | No | 0 | 243.4 | 114 | 41.38 | 121.2 | ... | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | 0 | 333 | 62.29 |
| 3 | OH | 84 | 408 | True | No | 0 | 299.4 | 71 | 50.90 | 61.9 | ... | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | 0 | 255 | 66.80 |
| 4 | OK | 75 | 415 | True | No | 0 | 166.7 | 113 | 28.34 | 148.3 | ... | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | 0 | 359 | 52.09 |
5 rows × 22 columns
To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows). The inplace argument tells whether to change the original DataFrame. With inplace=False, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns. With inplace=True, it alters the DataFrame.
# get rid of just created columns
df.drop(["Total charge", "Total calls"], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()
| State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | 415 | False | Yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | 0 |
| 3 | OH | 84 | 408 | True | No | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | 0 |
| 4 | OK | 75 | 415 | True | No | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | 0 |
| 5 | AL | 118 | 510 | True | No | 0 | 223.4 | 98 | 37.98 | 220.6 | 101 | 18.75 | 203.9 | 118 | 9.18 | 6.3 | 6 | 1.70 | 0 | 0 |
| 6 | MA | 121 | 510 | False | Yes | 24 | 218.2 | 88 | 37.09 | 348.5 | 108 | 29.62 | 212.6 | 118 | 9.57 | 7.5 | 7 | 2.03 | 3 | 0 |
pd.crosstab(df["Churn"], df["International plan"], margins=True)
| International plan | False | True | All |
|---|---|---|---|
| Churn | |||
| 0 | 2664 | 186 | 2850 |
| 1 | 346 | 137 | 483 |
| All | 3010 | 323 | 3333 |
import matplotlib.pyplot as plt
import seaborn as sns
# import some nice vis settings
sns.set()
# Graphics in the Retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'
sns.countplot(x="International plan", hue="Churn", data=df);
We see that, with International Plan, the churn rate is much higher, which is an interesting observation! Perhaps large and poorly controlled expenses with international calls are very conflict-prone and lead to dissatisfaction among the telecom operator’s customers.
Next, let’s look at another important feature – Customer service calls. Let’s also make a summary table and a picture.
pd.crosstab(df["Churn"], df["Customer service calls"], margins=True)
| Customer service calls | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | All |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Churn | |||||||||||
| 0 | 605 | 1059 | 672 | 385 | 90 | 26 | 8 | 4 | 1 | 0 | 2850 |
| 1 | 92 | 122 | 87 | 44 | 76 | 40 | 14 | 5 | 1 | 2 | 483 |
| All | 697 | 1181 | 759 | 429 | 166 | 66 | 22 | 9 | 2 | 2 | 3333 |
sns.countplot(x="Customer service calls", hue="Churn", data=df);
Although it’s not so obvious from the summary table, it’s easy to see from the above plot that the churn rate increases sharply from 4 customer service calls and above.
Now let’s add a binary feature to our DataFrame – Customer service calls > 3. And once again, let’s see how it relates to churn.
df["Many_service_calls"] = (df["Customer service calls"] > 3).astype("int")
pd.crosstab(df["Many_service_calls"], df["Churn"], margins=True)
| Churn | 0 | 1 | All |
|---|---|---|---|
| Many_service_calls | |||
| 0 | 2721 | 345 | 3066 |
| 1 | 129 | 138 | 267 |
| All | 2850 | 483 | 3333 |
sns.countplot(x="Many_service_calls", hue="Churn", data=df);
Data Visualization
Univariate analysis looks at one feature at a time. When we analyze a feature independently, we are usually mostly interested in the distribution of its values and ignore other features in the dataset.
Below, we will consider different statistical types of features and the corresponding tools for their individual visual analysis.
Quantitative features take on ordered numerical values. Those values can be discrete, like integers, or continuous, like real numbers, and usually express a count or a measurement.
The easiest way to take a look at the distribution of a numerical variable is to plot its histogram using the DataFrame’s method hist().
features = ["Total day minutes", "Total intl calls"]
df[features].hist(figsize=(10, 4));
A histogram groups values into bins of equal value range. The shape of the histogram may contain clues about the underlying distribution type: Gaussian, exponential, etc. You can also spot any skewness in its shape when the distribution is nearly regular but has some anomalies. Knowing the distribution of the feature values becomes important when you use Machine Learning methods that assume a particular type (most often Gaussian).
In the above plot, we see that the variable Total day minutes is normally distributed, while Total intl calls is prominently skewed right (its tail is longer on the right).
There is also another, often clearer, way to grasp the distribution: density plots or, more formally, Kernel Density Plots. They can be considered a smoothed version of the histogram. Their main advantage over the latter is that they do not depend on the size of the bins. Let’s create density plots for the same two variables:
df[features].plot(
kind="density", subplots=True, layout=(1, 2), sharex=False, figsize=(10, 4)
);
It is also possible to plot a distribution of observations with seaborn’s distplot(). For example, let’s look at the distribution of Total day minutes. By default, the plot displays the histogram with the kernel density estimate (KDE) on top.
sns.distplot(df["Total intl calls"]);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Another useful type of visualization is a box plot. seaborn does a great job here:
sns.boxplot(x="Total intl calls", data=df);
Let’s see how to interpret a box plot. Its components are a box (obviously, this is why it is called a box plot), the so-called whiskers, and a number of individual points (outliers).
The box by itself illustrates the interquartile spread of the distribution; its length is determined by the 25th(Q1) and 75th(Q3) percentiles. The vertical line inside the box marks the median (50%) of the distribution.
The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval (Q1−1.5⋅IQR,Q3+1.5⋅IQR) , where IQR=Q3−Q1
is the interquartile range.
Outliers that fall outside of the range bounded by the whiskers are plotted individually as black points along the central axis.
We can see that a large number of international calls is quite rare in our data.
The last type of distribution plots that we will consider is a violin plot.
Look at the figures below. On the left, we see the already familiar box plot. To the right, there is a violin plot with the kernel density estimate on both sides.
_, axes = plt.subplots(1, 2, sharey=True, figsize=(6, 4))
sns.boxplot(data=df["Total intl calls"], ax=axes[0])
sns.violinplot(data=df["Total intl calls"], ax=axes[1]);
The difference between the box and violin plots is that the former illustrates certain statistics concerning individual examples in a dataset while the violin plot concentrates more on the smoothed distribution as a whole.
In our case, the violin plot does not contribute any additional information about the data as everything is clear from the box plot alone.
Categorical features take on a fixed number of values. Each of these values assigns an observation to a corresponding group, known as a category, which reflects some qualitative property of this example. Binary variables are an important special case of categorical variables when the number of possible values is exactly 2. If the values of a categorical variable are ordered, it is called ordinal.
Let’s check the class balance in our dataset by looking at the distribution of the target variable: the churn rate. First, we will get a frequency table, which shows how frequent each value of the categorical variable is. For this, we will use the value_counts() method:
df["Churn"].value_counts()
0 2850 1 483 Name: Churn, dtype: int64
In our case, the data is not balanced; that is, our two target classes, loyal and disloyal customers, are not represented equally in the dataset. Only a small part of the clients canceled their subscription to the telecom service. As we will see in the following articles, this fact may imply some restrictions on measuring the classification performance, and, in the future, we may want to additionally penalize our model errors in predicting the minority “Churn” class.
The bar plot is a graphical representation of the frequency table. The easiest way to create it is to use the seaborn’s function countplot(). There is another function in seaborn that is somewhat confusingly called barplot() and is mostly used for representation of some basic statistics of a numerical variable grouped by a categorical feature.
Let’s plot the distributions for two categorical variables:
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
sns.countplot(x="Churn", data=df, ax=axes[0])
sns.countplot(x="Customer service calls", data=df, ax=axes[1]);
Multivariate plots allow us to see relationships between two and more different variables, all in one figure. Just as in the case of univariate plots, the specific type of visualization will depend on the types of the variables being analyzed.
Let’s look at the correlations among the numerical variables in our dataset. This information is important to know as there are Machine Learning algorithms (for example, linear and logistic regression) that do not handle highly correlated input variables well.
First, we will use the method corr() on a DataFrame that calculates the correlation between each pair of features. Then, we pass the resulting correlation matrix to heatmap() from seaborn, which renders a color-coded matrix for the provided values:
# Drop non-numerical variables
numerical = list(
set(df.columns)
- set(
[
"State",
"International plan",
"Voice mail plan",
"Area code",
"Churn",
"Customer service calls",
]
)
)
# Calculate and plot
corr_matrix = df[numerical].corr()
sns.heatmap(corr_matrix);
From the colored correlation matrix generated above, we can see that there are 4 variables such as Total day charge that have been calculated directly from the number of minutes spent on phone calls (Total day minutes). These are called dependent variables and can therefore be left out since they do not contribute any additional information. Let’s get rid of them:
numerical = list(
set(numerical)
- set(
[
"Total day charge",
"Total eve charge",
"Total night charge",
"Total intl charge",
]
)
)
The scatter plot displays values of two numerical variables as Cartesian coordinates in 2D space. Scatter plots in 3D are also possible.
Let’s try out the function scatter() from the matplotlib library:
plt.scatter(df["Total day minutes"], df["Total night minutes"]);
We get an uninteresting picture of two normally distributed variables. Also, it seems that these features are uncorrelated because the ellipse-like shape is aligned with the axes.
There is a slightly fancier option to create a scatter plot with the seaborn library:
sns.jointplot(x="Total day minutes", y="Total night minutes", data=df, kind="scatter");
The function jointplot() plots two histograms that may be useful in some cases.
Using the same function, we can also get a smoothed version of our bivariate distribution:
sns.jointplot(
"Total day minutes", "Total night minutes", data=df, kind="kde", color="g"
);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
In some cases, we may want to plot a scatterplot matrix such as the one shown below. Its diagonal contains the distributions of the corresponding variables, and the scatter plots for each pair of variables fill the rest of the matrix.
# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(df[numerical]);
In this section, we will make our simple quantitative plots a little more exciting. We will try to gain new insights for churn prediction from the interactions between the numerical and categorical features.
More specifically, let’s see how the input variables are related to the target variable Churn.
Previously, you learned about scatter plots. Additionally, their points can be color or size coded so that the values of a third categorical variable are also presented in the same figure. We can achieve this with the scatter() function seen above, but, let’s try a new function called lmplot() and use the parameter hue to indicate our categorical feature of interest:
sns.lmplot(
"Total day minutes", "Total night minutes", data=df, hue="Churn", fit_reg=False
);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
As we saw earlier in this article, the variable Customer service calls has few unique values and, thus, can be considered either numerical or ordinal. We have already seen its distribution with a count plot. Now, we are interested in the relationship between this ordinal feature and the target variable Churn.
Let’s look at the distribution of the number of calls to customer service, again using a count plot. This time, let’s also pass the parameter hue=Churn that adds a categorical dimension to the plot:
sns.countplot(x="Customer service calls", hue="Churn", data=df);
An observation: the churn rate increases significantly after 4 or more calls to customer service.
Now, let’s look at the relationship between Churn and the binary features, International plan and Voice mail plan.
_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))
sns.countplot(x="International plan", hue="Churn", data=df, ax=axes[0])
sns.countplot(x="Voice mail plan", hue="Churn", data=df, ax=axes[1]);
In addition to using graphical means for categorical analysis, there is a traditional tool from statistics: a contingency table, also called a cross tabulation. It shows a multivariate frequency distribution of categorical variables in tabular form. In particular, it allows us to see the distribution of one variable conditional on the other by looking along a column or row.
Let’s try to see how Churn is related to the categorical variable State by creating a cross tabulation:
pd.crosstab(df["State"], df["Churn"]).T
| State | AK | AL | AR | AZ | CA | CO | CT | DC | DE | FL | ... | SD | TN | TX | UT | VA | VT | WA | WI | WV | WY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Churn | |||||||||||||||||||||
| 0 | 49 | 72 | 44 | 60 | 25 | 57 | 62 | 49 | 52 | 55 | ... | 52 | 48 | 54 | 62 | 72 | 65 | 52 | 71 | 96 | 68 |
| 1 | 3 | 8 | 11 | 4 | 9 | 9 | 12 | 5 | 9 | 8 | ... | 8 | 5 | 18 | 10 | 5 | 8 | 14 | 7 | 10 | 9 |
2 rows × 51 columns
# Increase the default plot size and set the color scheme
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["image.cmap"] = "viridis"
df = pd.read_csv("Video_Games_Sales").dropna()
print(df.shape)
(6825, 16)
df[[x for x in df.columns if "Sales" in x] + ["Year_of_Release"]].groupby(
"Year_of_Release"
).sum().plot();
Now, let’s move on to the Seaborn library. seaborn is essentially a higher-level API based on the matplotlib library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding import seaborn as sns; sns.set() in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare matplotlib) require quite a large amount of code.
Let’s take a look at the first of such complex plots, a pairwise relationships plot, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output.
# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(
df[["Global_Sales", "Critic_Score", "Critic_Count", "User_Score", "User_Count"]]
);
As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.
It is also possible to plot a distribution of observations with seaborn’s distplot(). For example, let’s look at the distribution of critics’ ratings: Critic_Score. By default, the plot displays a histogram and the kernel density estimate.
sns.distplot(df["Critic_Score"]);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
To look more closely at the relationship between two numerical variables, you can use joint plot, which is a cross between a scatter plot and histogram. Let’s see how the Critic_Score and User_Score features are related.
sns.jointplot(x="Critic_Score", y="User_Score", data=df, kind="scatter");
Another useful type of plot is a box plot. Let’s compare critics’ ratings for the top 5 biggest gaming platforms.
top_platforms = (
df["Platform"].value_counts().sort_values(ascending=False).head(5).index.values
)
sns.boxplot(
y="Platform",
x="Critic_Score",
data=df[df["Platform"].isin(top_platforms)],
orient="h",
);
The last type of plot that we will cover here is a heat map. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let’s visualize the total sales of games by genre and gaming platform.
platform_genre_sales = (
df.pivot_table(
index="Platform", columns="Genre", values="Global_Sales", aggfunc=sum
)
.fillna(0)
.applymap(float)
)
sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=0.5);
We have examined some visualization tools based on the matplotlib library. However, this is not the only option for plotting in Python. Let’s take a look at the plotly library. Plotly is an open-source library that allows creation of interactive plots within a Jupyter notebook without having to use Javascript.
The real beauty of interactive plots is that they provide a user interface for detailed data exploration. For example, you can see exact numerical values by mousing over points, hide uninteresting series from the visualization, zoom in onto a specific part of the plot, etc.
Before we start, let’s import all the necessary modules and initialize plotly by calling the init_notebook_mode() function.
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
init_notebook_mode(connected=True)
years_df = (
df.groupby("Year_of_Release")[["Global_Sales"]]
.sum()
.join(df.groupby("Year_of_Release")[["Name"]].count())
)
years_df.columns = ["Global_Sales", "Number_of_Games"]
Figure is the main class and a work horse of visualization in plotly. It consists of the data (an array of lines called traces in this library) and the style (represented by the layout object). In the simplest case, you may call the iplot function to return only traces.
The show_link parameter toggles the visibility of the links leading to the online platform plot.ly in your charts. Most of the time, this functionality is not needed, so you may want to turn it off by passing show_link=False to prevent accidental clicks on those links.
# Create a line (trace) for the global sales
trace0 = go.Scatter(x=years_df.index, y=years_df["Global_Sales"], name="Global Sales")
# Create a line (trace) for the number of games released
trace1 = go.Scatter(
x=years_df.index, y=years_df["Number_of_Games"], name="Number of games released"
)
# Define the data array
data = [trace0, trace1]
# Set the title
layout = {"title": "Statistics for video games"}
# Create a Figure and plot it
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)
As an option, you can save the plot in an html file:
plotly.offline.plot(fig, filename="years_stats.html", show_link=False, auto_open=False);
Let’s use a bar chart to compare the market share of different gaming platforms broken down by the number of new releases and by total revenue.
# Do calculations and prepare the dataset
platforms_df = (
df.groupby("Platform")[["Global_Sales"]]
.sum()
.join(df.groupby("Platform")[["Name"]].count())
)
platforms_df.columns = ["Global_Sales", "Number_of_Games"]
platforms_df.sort_values("Global_Sales", ascending=False, inplace=True)
# Create a bar for the global sales
trace0 = go.Bar(
x=platforms_df.index, y=platforms_df["Global_Sales"], name="Global Sales"
)
# Create a bar for the number of games released
trace1 = go.Bar(
x=platforms_df.index,
y=platforms_df["Number_of_Games"],
name="Number of games released",
)
# Get together the data and style objects
data = [trace0, trace1]
layout = {"title": "Market share by gaming platform"}
# Create a `Figure` and plot it
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)
plotly also supports box plots. Let’s consider the distribution of critics’ ratings by the genre of the game.
data = []
# Create a box trace for each genre in our dataset
for genre in df.Genre.unique():
data.append(go.Box(y=df[df.Genre == genre].Critic_Score, name=genre))
# Visualize
iplot(data, show_link=False)
Stanford Introduction to Python Course
Intro to Data Visualization mlcourse.ai